Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import warningsThis groups the data by COVID period and subreddit, then calculates the mean compound sentiment score for each combination. The compound score is from VADER sentiment analysis (ranges from -1 most negative to +1 most positive). The code then converts the period to an ordered categorical variable to ensure proper chronological ordering in visualizations.
Creates an interactive grouped bar chart using Plotly Express. Each subreddit is represented by a different color, and bars are grouped by COVID period on the x-axis.
All three subreddits show negative sentiment scores even before COVID, which is expected given they’re mental health support communities. r/mentalhealth is slightly less negative, possibly because it’s more focused on general wellness and recovery rather than specific disorders.
All subreddits saw sentiment decline during COVID, with r/Anxiety experiencing the largest drop (from -0.14 to -0.23, a 64% increase in negativity). This aligns with research showing COVID-19 disproportionately impacted people with anxiety disorders due to uncertainty, isolation, and health fears. r/depression also worsened but less dramatically, while r/mentalhealth maintained its position as the least negative space.
This is the most alarming finding: sentiment continued to deteriorate post-COVID rather than recovering. r/Anxiety hit its lowest point at -0.30, representing a 114% increase in negativity from pre-COVID. r/depression also reached its nadir. This suggests the mental health crisis intensified after the acute pandemic phase, possibly due to accumulated trauma, ongoing disruption, or delayed mental health consequences.
Keywords by Subreddit Analysis
stressor_terms = {
'health_anxiety': ['heart', 'symptoms', 'panic attack', 'panic attacks', 'scared', 'pain', 'health', 'anxious', 'attack'],
'work_stress': ['job', 'home', 'house', 'wfh', 'remote', 'work'],
'school_stress': ['school', 'parents', 'mom', 'dad', 'remote school', 'class', 'online class'],
'burnout': ['tired', 'anymore', 'hate', 'exhausted', 'fucking tired', 'end'],
'therapy': ['therapist', 'therapy', 'counseling', 'telehealth', 'find help']
}
periods = ['Pre-COVID', 'During COVID', 'Post-COVID']
subreddits = reddit_sent_df['subreddit'].unique()
categories = list(stressor_terms.keys())
results = []Defines five stressor categories with associated keywords, similar to the previous notebook but focused on comparing how these manifest across different subreddits. This will allow analysis of whether certain communities discuss specific stressors more than others.
def count_total_words(text_series):
if text_series.empty:
return 0
total_words = text_series.astype(str).str.split().str.len().sum()
return total_words
# This function counts occurrences of a list of keywords
def count_keyword_mentions(text_series, keywords):
if text_series.empty:
return 0
# Create a regex pattern: 'word1|word2|word3'
pattern = r"\b(" + "|".join(re.escape(k) for k in keywords) + r")\b"
mentions = text_series.astype(str).str.count(pattern, flags=re.IGNORECASE).sum()
return int(mentions)Explanation
count_total_words():
Counts total words in a series of texts by splitting on whitespace
count_keyword_mentions():
Uses regex to count keyword occurrences with word boundaries ( to avoid partial matches. The re.IGNORECASE flag ensures case-insensitive matching.
for period in periods:
for sub in subreddits:
# Create the subset of data
subset_df = reddit_sent_df[
(reddit_sent_df["covid_period"] == period)
& (reddit_sent_df["subreddit"] == sub)
]
if subset_df.empty:
continue
# Get all text and total words for this subset
text_data = subset_df["full_text"]
total_words = count_total_words(text_data)
if total_words == 0:
continue
# Calculate frequency for each category
for category, keywords in stressor_terms.items():
mentions = count_keyword_mentions(text_data, keywords)
# Calculate frequency per 1000 words
freq_per_1000 = (mentions / total_words) * 1000 if total_words > 0 else 0
# Store the result
results.append(
{
"covid_period": period,
"subreddit": sub,
"category": category,
"frequency_per_1000": freq_per_1000,
"total_mentions": mentions,
"total_words": total_words,
}
)Explanation
This triple-nested loop iterates through each combination of period, subreddit, and stressor category. For each combination, it:
This creates a comprehensive dataset showing how often each stressor is mentioned in each subreddit during each period.
subreddit_order = sorted(keyword_freq_df['subreddit'].unique())
covid_period_order = ["Pre-COVID", "During COVID", "Post-COVID"]
fig = px.bar(
keyword_freq_df,
x='subreddit',
y='frequency_per_1000',
color='covid_period',
facet_col='category',
facet_col_wrap=3,
barmode='group',
category_orders={
'subreddit': subreddit_order,
'covid_period': covid_period_order
},
title='Keyword Frequency by Subreddit, Period, and Category',
labels={
'subreddit': "Subreddit",
"frequency_per_1000": "Frequency Per 1000 Words",
'covid_period': "COVID Period"
},
color_discrete_sequence=px.colors.sequential.Darkmint_r,
height=800,
facet_row_spacing=0.15,
facet_col_spacing=0.05
)
fig.update_xaxes(tickangle=45, matches=None, showticklabels=True)
fig.update_yaxes(matches=None, showticklabels=True)
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
fig.update_layout(margin=dict(b=100))
fig.show()Explanation
Creates a complex faceted bar chart with:
Facets: Separate panels for each stressor category (5 panels total)
X-axis: Subreddits (Anxiety, depression, mentalhealth)
Y-axis: Frequency per 1000 words
Color: COVID period (three shades showing temporal progression)
Layout: 3 columns per row, 800px height
Updates axes to allow independent scales per facet and rotates x-axis labels 45 degrees for readability
r/depression: - Pre-COVID: ~2.0 - During COVID: ~2.3 - Post-COVID: ~2.2
r/mentalhealth: - Pre-COVID: ~4.8 - During COVID: ~3.5 - Post-COVID: ~3.6
r/Anxiety: - Pre-COVID: ~7.0 - During COVID: ~9.2 - Post-COVID: ~10.0
Interpretation: r/Anxiety dominates health anxiety language, which makes sense given the subreddit’s focus. Strikingly, r/Anxiety’s health anxiety mentions increased 43% from pre-COVID to post-COVID (7.0 → 10.0 per 1000 words), showing sustained physical symptom focus even after the pandemic. r/mentalhealth actually decreased during COVID, possibly because general wellness discussions were crowded out by more acute concerns. r/depression remained relatively stable, suggesting depression discussions focus less on physical anxiety symptoms.
r/depression: - All periods: ~3.2-3.7
r/mentalhealth: - Pre-COVID: ~3.0 - During COVID: ~3.5 - Post-COVID: ~3.2
r/Anxiety: - Pre-COVID: ~3.0 - During COVID: ~5.8 - Post-COVID: ~4.2
Interpretation: r/Anxiety showed dramatic work stress increase during COVID (nearly doubling from 3.0 to 5.8), reflecting how work-from-home, job insecurity, and workplace changes particularly triggered anxiety. The post-COVID decline to 4.2 suggests partial adaptation but still 40% above baseline. r/depression and r/mentalhealth remained relatively stable, indicating work stress isn’t as central to depression/general mental health discussions.
r/depression: - All periods: ~3.8-4.0
r/mentalhealth: - Pre-COVID: ~3.5 - During COVID: ~3.4 - Post-COVID: ~3.3
r/Anxiety: - Pre-COVID: ~3.4 - During COVID: ~3.6 - Post-COVID: ~3.4
Interpretation: School stress showed remarkable stability across all subreddits and periods, hovering around 3.3-4.0 per 1000 words. This suggests academic stress is a constant background factor in these communities, relatively unaffected by COVID. The slight elevation in r/depression might reflect the higher prevalence of depression among students dealing with academic pressures.
r/Anxiety: - Pre-COVID: ~2.2 - During COVID: ~2.2 - Post-COVID: ~2.2
r/depression: - Pre-COVID: ~4.7 - During COVID: ~5.0 - Post-COVID: ~7.6
r/mentalhealth: - Pre-COVID: ~2.9 - During COVID: ~3.0 - Post-COVID: ~3.0
Interpretation: This is the most striking panel. r/depression’s burnout language exploded post-COVID, increasing 62% from pre-COVID (4.7 → 7.6 per 1000 words). During COVID it was only slightly elevated (5.0), but post-COVID saw massive increase in words like “tired,” “anymore,” “hate,” and “end.” This suggests depression communities are experiencing severe exhaustion and possibly increased suicidal ideation in the aftermath of COVID.
Paradoxically, r/Anxiety’s burnout remained completely flat (~2.2 across all periods), suggesting anxiety manifests more as acute distress rather than chronic exhaustion. r/mentalhealth also remained stable, possibly because it’s a more solutions-focused community.
r/Anxiety: - Pre-COVID: ~0.95 - During COVID: ~0.7 - Post-COVID: ~1.0
r/depression: - Pre-COVID: ~0.7 - During COVID: ~0.9 - Post-COVID: ~0.9
r/mentalhealth: - Pre-COVID: ~1.2 - During COVID: ~1.4 - Post-COVID: ~1.35
Interpretation: r/mentalhealth consistently discusses therapy most (1.2-1.4 per 1000 words), reinforcing its role as a resource-oriented community. The slight increase during COVID likely reflects telehealth discussions.
r/Anxiety shows a U-shaped pattern: therapy mentions dropped during COVID (0.95 → 0.7), possibly because acute crisis posts crowded out treatment discussions, then rebounded post-COVID to baseline levels.
r/depression showed modest increases, suggesting growing treatment-seeking behavior. Overall, therapy language remains relatively low across all communities (under 1.5 per 1000 words), which might indicate barriers to treatment access or stigma.
Each community has distinct concerns, validating their separate existence and suggesting tailored interventions would be more effective than one-size-fits-all approaches.
r/Anxiety showed the sharpest changes during COVID across multiple categories (work stress doubled, health anxiety spiked), while r/depression remained more stable during the acute phase but worsened dramatically afterward. This suggests:
The sentiment chart shows both communities worsened post-COVID, but the burnout data reveals r/depression experienced particularly severe deterioration. The 62% increase in burnout language suggests:
This community consistently shows:
This suggests general mental health communities may provide more balanced support than diagnosis-specific spaces, though both serve important roles.
The progression from acute anxiety during COVID to severe burnout/depression post-COVID across the overall dataset might represent a temporal mental health cascade:
This aligns with research showing chronic anxiety can lead to depression when stressors persist without resolution.
Synthesizing all four notebooks:
Volume explosion: Mental health discussions increased 18x during COVID and remain 10x elevated
Sentiment deterioration: All communities experienced worsening sentiment, with continued decline post-COVID (especially r/Anxiety at -0.30)
Stressor evolution:
Community-specific impacts:
Public health implications:
This comprehensive analysis reveals COVID-19 triggered a profound, multifaceted, and persistent mental health crisis that varies by community but universally shows no signs of resolution. The data suggests we’re experiencing the mental health consequences now, years after the acute pandemic phase.